Risk Bounds for CART Classifiers under a Margin Condition 

Servane Gey* 



Abstract 

■ Risk bounds for Classification And Regression Trees (CART) classifiers are ob- 
tained under a margin condition in the binary supervised classification framework. 

^ ■ These risk bounds are derived conditionally on the construction of the maximal bi- 

nary tree and permit to prove that the linear penalty used in the CART pruning 
04 ■ algorithm is valid under a margin condition. It is also shown that, conditionally on 

the construction of the maximal tree, the final selection by test sample does not alter 
^ 1 i dramatically the estimation accuracy of the Bayes classifier. 

In the two-class classification framework, the risk bounds obtained by using penal- 
ized model selection validate the CART algorithm which is used in many data mining 
^ ^ applications in Biology, Medicine or Image Coding for instance. 
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The Classification And Regression Trees (CART) method proposed by Breiman, Fried- 



' man, Olshen and Stone [9] in 1984 consists in constructing an efficient algorithm that 

gives a piecewise constant estimator of a classifier or a regression function from a training 
' sample of observations. This algorithm is based on binary tree-structured partitions and 

on a penalized criterion that selects "good" tree-structured estimators among a huge col- 
^ ' lection of trees. It currently yields some easy-to-interpret and easy-to-compute estimators 

which are widely used in many applications in Medicine, Meteorology, Biology, Pollution 
or Image Coding (see [lOj . |39j for example). This kind of algorithm is often performed 
when the space of explanatory variables is high-dimensional. Due to its recursive compu- 
tation, CART needs few computations to provide convenient classifiers, which accelerates 
the computation time drastically when the number of variables is large. It is now widely 
used in the genetics framework (see [16] for example) , or more generally to reduce variable 
dimension (see [33] [26] for example). 



The CART algorithm provides classifiers or regressors represented by binary decision trees. 
An example of the latter is given in Figure [TJ Suppose we have a couple of covariates 
{Xi,X2) belonging to [0; 1]^. The partition is defined recursively by a sequence of ques- 
tions asked at each node of the tree: if the answer is positive, go to the left node, if not, 
go to the right node. Hence the first question corresponds to a two-part partition of the 
covariate space. Then, each part is split into two subparts, and so on. Hence each node 
of the tree represents a subset of the covariates space defined by the successive questions. 
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X2 < 0.^26199 



xl < 0. 566982 



X2 < 0.1367414 



xl < 0. 533086 



xl < 0. i67008 



xl < 0. i32943 



xl < 0. 566743 



X2 < 0.333337 



Figure 1: Decision tree example 



The final partition is given by the leaves of the tree. Finally, a predictive value for the 
dependent variable is associated to each leaf. 

To construct such a tree from a training sample of observations, the CART algorithm 
consists in constructing a large dyadic recursive tree from the observations by minimizing 
some local impurity function at each step. Then, the constructed tree is pruned to obtain 
a finite sequence of nested trees thanks to a penalized criterion, whose penalty term is 
proportional to the number of leaves. The linearity of the penalty term is fundamental to 
ensure that the whole information is kept in the obtained sequence. CART differs from 
the algorithm proposed by Blanchard et al. [5j by the fact that the first large tree is con- 
structed locally, and not in a global way by minimizing some loss function on the whole 
sample. For further results on the construction of the large tree, we refer to Nobel [301l31j. 
and Nobel and Olshen [32] about Recursive Partitioning. 

In this paper, our concern is the pruning step which entails the choice of the penalty 
function. Gey et al. pT| gave an answer to this question in the regression framework. Fol- 
lowing this previous work, the present paper aims at validating the choice of the penalty 
in the two class classification framework. In what follows, we establish the link between 
the CART algorithm and a model selection procedure, where the collection of models is a 
collection of random decision trees constructed on the training sample of observations. In 
its pruning procedure, CART selects a small collection of trees within the whole collection 
of random trees. Then, a final tree belonging to the small collection is selected either by 
cross-validation or by test sample. The present paper focuses on the test sample method. 
We exhibit risk bounds for the chosen tree under some conditions on the joint distribution 
of the variables. These risk bounds validate the choice of the penalty used in the pruning 
step, and show that the impact of the selection via test sample is conveniently controled. 



The CART method takes place in the following general classification framework. Sup- 
pose one observes a sample C of N independent copies {Xi, Yi), . . . , {X]\f,Y]\f) of the ran- 
dom variable (X, Y), where the explanatory variable X takes values in a measurable space 
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X and is associated with a label Y taking values in {0, 1}. A classifier is then any function 
/ mapping X into {0, 1}. Its quality is measured by its misclassification rate 

P(/(X) / Y), 

where P denotes the joint distribution of {X,Y). If P were known, the problem of find- 
ing an optimal classifier minimizing the misclassification rate would be easily solved by 
considering the Bayes classifier /* defined for every x £ X hy 

f*{x) = Il^(a;)>l/2, (1) 

where r/(x) is the conditional expectation of Y given X = x, that is 

r/(x) =F[Y = 1\ X = x]. (2) 

As P is unknown, the goal is to construct from the sample C = {{Xi,Yi), . . . , (Xjv, Itv)} 
a classifier / that is as close as possible to /* in the following sense: since /* minimizes 
the misclassification rate, / will be chosen in such a way that its misclassification rate is 
as close as possible to the misclassification rate of /*, i.e. in such a way that the expected 
loss 

Kf, f) = P{f{x) / y) - p(/*(x) / Y) (3) 

is as small as possible. Then, the quality of / will be measured by its risk, i.e. the 
expectation with respect to the £-sample distribution 

i?(/,r) = E[/(r,/)]. (4) 

Numerous works have dealt with the issue of predicting a label from an input x £ X via 
the construction of a classifier (see for example [1], [38], [H], [3l], [18]). There is a large 
collection of methods coming both from computational and statistical areas and based on 
learning a classifier from a learning sample, where the inputs and labels are known. For 
a non exhaustive yet extensive bibliography on this subject, we refer to Boucheron et al. 
[6]. We based our computation of risk bounds for the CART classifier on recent results 
(see for instance [25], [55], [36], [29], [MlllI], [28], [22], [l9]). They stem from Vapnik's 
results (see [37], [23] for example), which show that, without any assumption on the joint 
distribution P, the penalty term used in the model selection procedure is proportional to 
the square root of the number of leaves over A^. Nevertheless, it has also been shown that, 
under the overoptimistic zero-error assumption (that is 1" = r]{X) almost surely, where r] 
is defined by ([2])), this penalty term is proportional to the number of leaves over N. 
In fact, these two extreme cases can be modulated by so-called margin assumptions, which 
permit to compare the loss of a classifier with its distance to the Bayes classifier /*. 
Numerous margin assumptions have been investigated by the above-cited authors; some 
permit to obtain penalty terms proportional to the number of leaves over N to the power 
K, with 1/2 ^ K ^ 1 (see for example |,36j and [29j). Hence these margin assumptions 
make a link between the "global" pessimistic case (without any assumption on P) and 
the zero-error case. More recent works (see [201 I21j . [2] for instance) deal with data- 
driven penalties based on local Rademacher complexities and use more general margin 
assumptions than those proposed in [25j and [29j. Those works also show that the margin 
assumption necessary to obtain a penalty term proportional to the number of leaves over 
N is one of the strongest. Let us introduce the following margin assumption: 



MA(1) 3/iG]0;l[ yf:X^{0;l} /(/*,/) ^ /i E [(/(A) - r(A))2] , 
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where I is the expected loss defined by ([3]). Margin assumption MA(1) is impHed by the 
more intuitive assumption proposed by Massart et al. in f29j (see also the slightly weaker 
condition proposed in [19J): 

MA(2) 3/i g]0;1[ P (|2r/(X) - 1| ^ /i) = 0. 

Assumption MA(2) means that {X, Y) is sufficiently well distributed to ensure that there 
is no region in X for which the toss-up stategy could be favored over others: h can be 
viewed as a measurement of the gap between labels and 1 in the sense that, if 7/(2;) is 
too close to 1/2, then choosing or 1 will not make a real difference for that x. 
Below, we prove that, under MA(1), the penalty used by CART in the pruning step is 
convenient. 

In the rest of the paper, the constant h will denote the so-called margin. Of course margin 
assumption MA(1) is chosen for its relevance in the particular framework of CART and 
shall be adapted, or simply ignored, depending on the problem under study. 

As mentioned above, we leave aside the construction of the first large tree. Thus, all 
our upper bounds for the risk of the classifier obtained by CART are considered condition- 
ally on the recursive construction of the first large tree, called maximal tree. Moreover, 
we focus on non-asymptotic bounds. 

We also leave aside the problem of consistency of CART. CART is known to be noncon- 
sistent in many cases. Some results and conditions to obtain consistency can be found in 
the paper by Devroye et al. [11]. Section [3] briefly presents consistent results for CART 
based on the risk bounds obtained. 

We focus on two methods that use a test sample: let us split C in three independent 
subsamples £1, £2 and £3, containing respectively ni, n2 and 713 observations, with 
ni + 712 + = N . £1, £2 and £3 are taken randomly in £, except if the design is 
fixed. In that case one takes, for example, one observation out of three to obtain each 
subsample. Given these three subsamples, suppose that either a large tree is constructed 
using £1 and then pruned using £2 (as done in Gelfand et al. 114]), or a large tree is 
constructed and pruned using the subsample £1 U £2 (as done in [9]). 
Then the final step used in both cases is to choose a subtree among the sequence by making 
£3 go down each tree of the sequence and selecting the tree having the minimum empirical 
misclassification rate : for k = 1,2,3 and for / a classifier, the empirical misclassification 
rate of / on is given by: 

The final estimator / of /* is defined by: 

/= argmin TnaC/rJ , (6) 

where is the piecewise binary estimator of /* defined on the leaves of the tree Tj and 
K is the number of trees appearing in the sequence. 

The paper is organized as follows. Section [2] gives an overview of the CART algo- 
rithm, and introduces the methods and notations used in the following sections. Section 
[3] presents the main theoretical results for classification trees: Theorem [T] bears on the 
whole algorithm, while Propositions [H [2] concern the pruning procedure and Proposition [3] 
concerns the final step. Section [J] offers propects about the margin effect on classification 
trees. Proofs are gathered in Section O 
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2 The CART Procedure 

Let us give a short account of the CART procedure in the classification case and recall 
the results associated with it, which are fully explained in [9]. 

CART is based on a recursive partitioning using a training sample C of the random variable 
iX,Y) G X {0, 1} (£ = £i or £ = £i U C2), and a class S of subsets of X which tells 
us how to split at each step. For instance, if = M'^, 5 is usually taken as some class 
of half-spaces of X, for example the half-spaces of X with frontiers parallel to the axes 
(see for example [9], [12]). Below, we consider a class S with finite Vapnik-Chervonenkis 
dimension, henceforth referred to as VC-dimension (for a complete overview of the VC- 
dimension see [37]). 

The procedure is computed in two steps, called the growing procedure and the pruning 
procedure. The growing procedure permits to construct a maximal binary tree T^ax from 
the data by recursive partitioning, and then the pruning procedure permits to select, 
among all the subtrees of Tmax 1 a sequence that contains the entire statistical information. 

2.1 Growing and pruning procedures 

2.1.1 Growing Procedure 

Since our main interest in this paper is the pruning procedure, we present an overview of 
the growing procedure (for more details about the growing procedure, see [9]). 
The growing procedure is based on a recursive binary partitioning of X. Let us start with 
the first step: X is split into two parts by minimizing some empirical convex function on 
S. A strictly convex function is used in order to avoid ties, which is systematically the case 
when using the simplest empirical misclassification rate (see [9], [21] )• Thus this function 
is chosen in such a way that the data are split into two groups where the labels of the data 
in each group are as similar as possible. It implies that the empirical misclassification rate 
in each subgroup is largely reduced. Note that the sum of empirical misclassification rates 
of each subgroup (called node) is always smaller than the global empirical misclassification 
rate on the sample C (called the root ti of the tree). In the tree terminology, one adds to 
the root ti a left node ti and a right node tpt- In what follows, we always assimilate a tree 
node with its corresponding subset in S. Finally, a label is given to each node by majority 
vote (which corresponds to minimizing the empirical misclassification rate in each node). 

Then the same elementary step is applied recursively to the two generated subsamples 
{{Xi,Yi) ; Xi G ti} and {{Xi,Yi) ; Xi £ tji} until some convenient stopping condition 
is satisfied. This generates the maximal tree Tmax] one calls terminal nodes or leaves the 
final nodes of T^ax- 

2.1.2 Pruning Procedure 

Recall that a pruned subtree of Tmax is defined as any binary subtree of T^ax having the 

same root ti as Tmax- 

Now, let us introduce some notations: 

(i) Take two trees Ti and T2. Then, if Ti is a pruned subtree of T2, write Ti ^ T2. 

(ii) For a tree T, T denotes the set of its leaves and \T\ the cardinality of T. 

To prune Tmaxi one proceeds as follows. First simply denote by n the number of data 
used. Notice that, given a tree T and a set of binary piecewise functions in L^(Af) 
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defined on the partition given by the leaves of T, one has 

= ^argmax^y. . x,&} ; G t}| 11^, 
tef 

where jn is the empirical misclassification rate defined by ([5]) and llt(x) = 1 if x falls in 

the leaf t, llf(x) = otherwise. 

Then, given T ^ Tmax and a > 0, one defines 

crit„(r) = 7„(/T) + a^ (7) 

n 

the penalized criterion for the so called temperature a, and the subtree of T^ax satis- 
fying: 

(i) Ta = argmin7.^7.^^^crit„(r), 

(ii) if critQ,(T) = crita(TQ), then ^ T. 

Thus Ta is the smallest minimizing subtree for the temperature a. The existence and the 
unicity of are proved in [9l pp 284-290] . 

The aim of the pruning procedure is to raise temperature a and to record the corresponding 
Ta- The algorithm is iterative: it consists in minimizing a function of the nodes at each 
step, which leads to a finite decreasing sequence of subtrees pruned from Tmax 

Tmax hTi y . . . y Tk^I >~ Tk = {h} 

corresponding to a finite increasing sequence of temperatures 

= ai < a2 < ■ ■ ■ < ok-i < cxk, 

where ti corresponds to the root of Tmax as defined in the growing procedure. 

Remark 1. Ti is the smallest subtree for temperature 0, so it is not necessarily equal to 
T 

max ■ 

Breiman, Friedman, Olshen and Stone's Theorem |9] justifies this algorithm: 
Theorem 2.1.1 (Breiman, Friedman, Olshen, Stone). 

The sequence (afc)i^fc^i^ is nondecreasing, the sequence {Tk)i<^k!iK is nonincreasing and, 
given k£ {I,... ,K}, if /3 G [ak,ak+i[, then Tp = T^^ = Tfc. 



This theorem allows us to check that, for any q > 0, belongs to the sequence {Tk)i^k^K- 
This algorithm significantly reduces the complexity of the choice of a subtree pruned from 
Tmax , since by Theorem 12.1.11 the sequence of pruned subtrees contains the whole statisti- 
cal information according to the choice of the penalty function used in ([7|). Consequently 
it is useless to look at all the subtrees. Notice that the form of the penalized criterion is 
essential to obtain Theorem 12.1.11 Hence, to fully validate this algorithm completely, we 
need to show that the choice of penalty is relevant. 

The final step is to choose a suitable temperature a. Instead of minimizing over q, this 
issue is dealt with by using a test-sample to provide the final estimator /, as mentioned 
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in the Introduction, via equality ([6]). The results given in Sections 13.11 and 13.21 deal with 
the performance of the piecewise constant estimators given by Tq, for a fixed and with the 
performance of / respectively. 

Before focusing on risk bounds, let us present the methods and notations used to obtain 
these bounds. 

2.2 Methods and Notations 

For a given tree T, J^t will denote the set of classifiers defined on the partition given by 
the leaves of T, that is 



where T refers the set of the leaves of T. Thus /t is the empirical risk minimizer classifier 
on Tt- For any tree-structured estimator / of /*, / is said to satisfy an oracle inequality 
if there exists some nonnegative constant C, such that 



where, Rcii-, /*) = E [l{f*, .) | and E[. | Ci] denotes the conditional expectation given 
the subsample Ci. 

To estimate /* using the CART algorithm and to compare the performance of / with 
those of each fx, two different methods can be applied: 

Ml: C is split in three independent parts £i, £2 and £3 containing respectively ni, 
n2 and observations, with ni + 712 + ns = A^. Hence T^ax is constructed using 
£1, then pruned using £2 and finally the best subtree T is selected among the 
sequence of pruned subtrees thanks to £3, and we define / = ff. 

M2: £ is split in two independent parts £1 and £3 containing respectively ni and 723 
observations, with ni +713 = A^. Hence T^ax is constructed and pruned using £1 
and finally the best subtree T is selected among the sequence of pruned subtrees 



Note that a penalty is needed in both methods in order to reduce the number of candidate 
tree-structured models contained in T^ax- Indeed, if one does not penalize, the number 
of models to be considered grows exponentially with A^ (see [9j). So making a selection 
by using a test sample without penalizing requires visiting all the models. In that case, 
looking for the best model in the collection of all subtrees pruned from the maximal one 
becomes explosive. Hence penalizing permits to reduce significantly the number of trees 
taken into account; it provides a convenient risk for /. 




(8) 



thanks to £3, and we define / = 



3 Risk Bounds 



This section is devoted to the results obtained on the performance of the CART classifiers 
for both methods Ml and M2. We shall first present a general theorem, then give more 
precise results on the last two parts of the algorithm, which are the pruning procedure 
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and the final selection by test sample. 

Assume that the following margin assumption is fulfilled: there exists some absolute con- 
stant h g]0; 1[ such that, for every classifier /, 

MA(la) l{f*, f) ^ /i E [(fix) - if / is constructed via Ml 

MA(lb) lnAf*J) ^ h—^{f{Xi)-p{Xi)f if / is constructed via M2 

where / is the expected loss ([3]) defined in Section [U X^^ = {Xi ; {Xi,Yi) G Ci} and 
f-ni{f*,f) is the empirical expected loss conditionally on the grid X^^ defined by 



=Ey 



ni 

. -^1 



(9) 



with Ey the expectation with respect to the marginal distribution of Y. 



Theorem 1. Given N independent pairs of variables ((^j, li))i^j^Ar of common distri- 
bution P, with {Xi,Yi) £ X X {0,1}, let us consider the estimator f ^ of the Bayes 
classifier f* (Cp obtained via the CART procedure as defined in section [B Then we have 
the following results. 

(i) if f is constructed via Ml: 

Let l{f*,f) be the expected loss ^ of f and h be the margin given by MA(la). Then 
there exist some absolute constants C , Ci and C2 such that 



E 



'l{f\f)\C^ ^ C mi J inf E[/(r,/) I + (10) 
J T<Tmax yfeJ^T n2 J n2 

+h-W^^. (11) 

713 

(a) if f is constructed via M2: 

Let be the product distribution on Ci, let (/*:/) be the empirical expected loss of f 
conditionally on the grid X"^ and h be the margin given by MA(lb). Let V be the 
Vapnik-Chervonenkis dimension of the set of splits used to construct Tmax o,nd suppose 
that ni ^ V. Let K be the number of pruned subtrees of the sequence provided by the 
pruning procedure. Then there exist some absolute constants C' , C[, C(' and C2 such that, 
for every 5 s]0; 1[, on a set verifying Fcj^{ils) >1 — 6, 



E[/,,(r,/)iAj ^ ^V4?LV-i^'^.(r'/) + ^-^ioHyj7^|+^-^;^a2) 

+h-W^, (13) 

with Cs = C7{ + C7{'log(l/5). 

Note that the constants appearing in the upper bounds for the risks are not sharp. We do 
not investigate the sharpness of the constants here. 



Let us comment the results given in Theorem [T] 
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1) Both methods Ml and M2 are considered for the following reasons: 

• Since all the risks are considered conditionally on the growing procedure, the Ml 
method permits to make a deterministic penalized model selection and then to 
obtain sharper upper bounds than the M2 method. 

• On the other hand, the M2 method permits to keep the whole information given 
by £i. Indeed, in that case, the sequence of pruned subtrees is not obtained via 
some plug-in method using a first split of the sample to provide the collection of 
tree-structured models. This method is the one proposed by Breiman et al. and it 
is more commonly applied in practice than the former. We focus on this method 
to ensure that it provides classifiers that have good performance in terms of risk. 

2) For both the Ml and M2 methods, the inequality of Theorem [1] can be separated into 
two parts: 

• (fTO]l and (fT2]l correspond to the pruning procedure. They show that, up to some 
absolute constant and the final selection, the conditional risk of the final classifier 
is approximately of the same order as the infimum of the penalized risks of the 
collection of subtrees of Tmax- The term inside the infimum is of the same form 
as the penalized criterion ([7]) used in the pruning procedure. This shows that, 
for a sufficiently large temperature a, this criterion permits to select convenient 
subtrees in term of conditional risk. Let us emphasize that the penalty term 
is directly proportional to the number of leaves in the Ml method, whereas a 
multiplicative logarithmic term appears in the M2 method. This term is due to 
the randomness of the models considered, since the samples used to construct and 
prune Tmax are no longer independent. 

• (jlip and (|13p correspond to the final selection of / among the collection of pruned 
tree structured classifers using £3. As ^ ni, this selection adds a term pro- 
portional to logni/n3 for both methods, which shows that not much is lost when 
a test sample is used provided that na is sufficiently large with respect to logni. 
Nevertheless, since we have no idea of the size of the constant C2, it is difficult to 
deduce a general way of choosing £3 from this upper bound. 

3) Let us comment the role of the Vapnik-Chervonenkis dimension of the set of splits S 
used to construct Tmax- Let us take the more often used case in CART, where S is 
the set of all half-spaces oi X = W^. In this particular case, we have V = d + 1. So, if 
X is low dimensional, the logni term has to be taken into account in the risk bound. 
Nevertheless, if CART provides models such that 

- the maximal dimension of the models is = o (N/ log A^), 

- the approximation properties of the models are convenient enough to ensure that 
the bias tends to zero with increasing sample size N, 

then we have a result of consistency for / if ns is conveniently chosen with respect to 
logni. 

4) Let us emphasize the role of the margin in the quality of the selected classifier. Theo- 
rem [T] shows that the higher the margin, the smaller the risk, which is intuitive since 
the more separable the labels are, the easier the classification shall be. This confirms 
the fact that CART does a convenient job if margin assumption MA(la) or MA(lb) 
is fulfilled. 
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Furthermore, let us comment briefly on the size of the margin to obtain oracle-type 



inequalities in Theorem [TJ Massart et al. |29] show that, if /i ^ ^^\T\/n for one model 
(where n = n2 for Ml and n = ni for M2), then the upper bound for the risk on 



this model (and then the penalty term in our framework) is of order y |T|/n. They 
obtain this result via minimax bounds for the risk that make a connection between the 
zero error case (corresponding to h = 1/2), with a minimax risk of order \T\/n, and 
the "global" pessimistic case (corresponding to /i = 0), with a minimax risk of order 

f\/n. 



These results suggest that Theorem [T] gives oracle- type inequalities only if /i > y |T|/n 
for every tree T pruned from T^ax ■ Let us recall that the pruning procedure and con- 
sequently the results of Theorem 12.1.11 heavily depend on the linearity of the penalized 
criterion ([7|). It is not clear whether these results remain valid when using a non-linear 
penalty function, so we need to keep a penalty term of order \T\/n to ensure that the 
sequence of pruned subtrees contains the whole statistical information. Hence CART 



will underpenalize trees for which h ^ \j\T\/n, since in that case the penalty term 



should be of order y \T\/n > \T\/n. Due to the recursiveness of the pruning algorithm, 
if the above mentioned case occurs, then CART may select classifiers having an exces- 
sive number of leaves. 

Nevertheless, the condition on the size of the margin can be forced via the growing 



procedure. Indeed, if the condition h > \J\Tmax\/n is fulfilled, then the penalty is opti- 
mal in terms of risk. This condition can be controled during the growing procedure by 
forcing the maximal tree's construction to stop earlier for example. This is obviously 
difficult to do in practice since it heavily depends on the data and on the size of the 
learning sample, and is worth being investigated more deeply (on going work). 



The two following subsections give more precise results on the pruning algorithm for both 
the Ml and M2 methods, and particularly on the constants appearing in the penalty 
function. Subsection 13.21 validates the discrete selection by test-sample. Note that the 
two results obtained for the validation of the pruning algorithm also hold in the case of 
deterministic XiS. 



3.1 Validation of the Pruning Procedure 

In this section, we focus more particularly on the pruning algorithm and give trajectorial 
risk bounds for the classifier associated with Tq,, the smallest minimizing subtree for the 
temperature a defined in subsection 13.11 We show that, for a convenient constant a, /t^ 
is not far from /* in terms of its risk conditionally on Ci. Let us emphasize that the 
subsample £3 plays no role in the two following results. 



3.1.1 / constructed via Ml 

Here we consider the second subsample £2 of n2 observations. We assume that T^aax is 
constructed on the first set of observations £1 and then pruned with the second set £2 
independent of £1. Since the set of pruned subtrees is deterministic according to £2, we 
make a selection among a deterministic collection of models. 

For any subtree T of Tmax, let Tt be the model defined on the leaves of T given by 
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/* will then be estimated on J^t, whose dimension is |r|. 

Then we choose the estimators as follows: let 7^2 be the empirical contrast as defined by 
©• 

• For T ^ Tmax, fr = argmin^^jr^, bnal/)], 

• For a > 0, Ta is the smallest minimizing subtree for the temperature a as defined 
in subsection [2X2] and /t„ = argmin^g j-^,^ [7n2(/)]- 

Let us now consider the behaviour of Jt^ ■ 

Proposition 1. Let be the product distribution on C2 and let h be the margin given 
by MA(la). Let ^ > 0. 

There exists a large enough positive constant oq > 2 + log2 such that, if a > ao, then 
there exist some nonnegative constants Sq, and C such that 

-i :^J-max I J^-TT il'2 I "-2 

on a set such that Pcjl^?) ^ 1 — "^0^^^, where I is defined by Ci{a) > oq and Ha 
are increasing with a. 

We obtain a trajectorial non-asymptotic risk bound on a large probabilty set, leading to 
the conclusions given for Theorem[TJ Nevertheless, taking an excessive temperature a will 
overpenalize and select a classifier having high risk E[/(/*,/j'^) | Ci]. Furthermore, the 
fact that Ci{a) and are increasing with a suggests that both sides of the inequality 
grow with a. The choice of the convenient temperature is then critical to make a good 
compromise between the size of E[Z(/*, Jt^) \ Ci] and a large enough penalty term. 
In practice, since this temperature depends on the unknown margin h and some unknown 
constants, the use of a test sample as described in Section [T] is a convenient choice, as 
shown by Proposition [3l 

3.1.2 / constructed via M2 

In this subsection we define the different contrasts, expected loss and estimators exactly 
in the same way as in subsection 13.1.11 although / is replaced by the empirical expected 
loss on = {Xi ; {Xi, Yi) G £1} defined by Q, 

lnArJ)=^Y [7n,(/)-7ni (/*)], 

since the models and the evaluations of the empirical errors 7„j {fx) are computed on the 
same grid X^^ . In this case, we obtain nearly the same performance for fx^ despite the 
fact that the constant appearing in the penalty term can now depend on ni: 

Proposition 2. Let Pci be the product distribution on L\, Im (0^ be the empirical expected 
loss computed on {Xi ; {Xi,Yi) £ Ci}, and let h be the margin given by MA(lb). Let 
^ > and 

an„v = 2 + y/2 (1 + log y ) . 

There exists a large enough positive constant oq such that, if a > ao, then there exist some 
nonnegative constants Sq, and C such that 

InArJr^) ^ C[{a) -ini | inf /„,(/*,/) + /^'Vy— l + C^' /^"'^ 
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on a set such that Pci{^^) ^ 1 — 2X^6 ^, where C[{a) > ao and are increasing 
with a. 

We obtain a similar trajectorial non-asymptotic risk bound on a large probabilty set. The 
same conclusions as those derived from the Ml case hold in this case. Let us just mention 
that the penalty term takes into account the complexity of the collection of trees having 
fixed number of leaves which can be constructed on {Xi ; {Xi,Yi) G Since this com- 
plexity is controlled via the VC-dimension V, V necessarily appears in the penalty term. 
It differs from Proposition [J in the sense that the models we consider are random, so this 
complexity has to be taken into account to obtain an uniform bound. 

Example: Let us consider the case where S is the set of all half-spaces of A' = M'^ (which 
is the most common case in the CART algorithm). In this case, V = d + 1, consequently, 
if ni > (i -|- 1, we obtain a penalty proportional to 

/ 4 + (d + l)(l + log [ni/{d+l)]) \ \T\_ 
\ 2h ) ui' 

So, if CART provides some minimax estimator on a class of functions, the logni term 
always appears for /* in this class when working in a linear space of low dimension. 

As for the Ml case, since the temperature a depends on the unknown margin h and 
some unknown constants, the use of a test sample to select the final classifier among the 
sequence of pruned subtrees is a convenient choice, as shown by Proposition [3j 



3.2 Final Selection 

We focus here on the final step of the CART procedure: the selection of the classifier / 
among the collection of pruned subtrees given by the pruning procedure by using a test 
sample £3. Given the sequence {Tk)i^k^K pruned from Tmax as defined in subsection l3.lt 
let us recall that / is defined by 



/ 



argmm 



The performance of this classifier can be compared to the performance of the collection of 
classifiers {sTf,)i^k^K by the following: 



A(r,/) I A, A 



Proposition 3. 

{i) if f is constructed via Ml, let X = I and Rn,^{f*,f) = E 

(ii) if f is constructed via M2, let A = and Rnz{f*-,f) - 
in defined by 

For both cases, there exist three absolute constants C" > 1, C[ > 3/2 and > 3/2 such 



E 



X{f*,f) I Ci , where In-, 



that 



RnAf*J) ^ C" inf Xif*jT,)+C[h-'^-^ + h-^^, 



where K is the number of pruned subtrees extracted during the pruning procedure. 
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4 Concluding Remarks 

We have proven that CART provides convenient classifiers in terms of conditional risk 
under a margin condition. Nevertheless, as for the regression case, the properties of the 
growing procedure need to be analyzed to obtain full unconditional upper bounds. 
The remarks made after Theorem [1] on the size of the margin h give some prospects for 
the application of CART in practice. These prospects may be for example 

• using the slope heuristic (see for example ^ ^) to select a classifier among a col- 



• searching for a robust manner to determine if the margin assumption is fulfilled, 
permitting to use the blind selection by test sample. 

Some track to estimate the margin h if assumption MA(la) or MA(lb) is fulfilled could 
be to use mixing procedures as boosting (see [8] [ISj for example). Hence this estimate 
could be used in the penalized criterion to help find the convenient temperature. It could 
also give an idea of the difficulty to classify the considered data and henceforth to help 
choose the most adapted classification method. 

5 Proofs 

Let us start with a preliminary result. 

5.1 Local Bound for Tree-Structured Classifiers 

Let {X, Y) £ X X {0; 1} be a pair of random variables and {(^i, ^i), . . . , {Xn, ^n)} be n 
independent copies of {X,Y). Let ||.||„ denote the empirical norm on Xf = (Xj)i<jj<g„. 
Then given two classifiers / and g, let us define 



Let Ain be the set of all possible tree-structured partitions that can be constructed on 
the grid X^, corresponding to trees having all possible splits in S and all possible forms 
without taking account of the response variable Y. So A^Jj only depends on the grid AT" 
and is independent of the variables (Yi, . . . , Yn). Hence, for a tree T S A4Ji, define 



where T refers the set of the leaves of T. Then, for any / G J^t and any o" > 0, define 



For each classifier / : X — t- {0, 1}, let us define the empirical contrast of / recentered 
conditionally on A" 



lection. 



dl{f,g) = -Y,UiX.)-9iX.)f ■■=\\f-g 



1=1 




BT{f,<T) 



{g eJ^T ; dn{f,g) ^ o-} 




(14) 
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Remark 2. If 7„ is evaluated on a sample (X-) independent of X'^, it is easy to check 
that the bounds we obtain in what follows are still valid by defining the distance with 
respect to the marginal distribution of X instead of the empirical distribution. 

We have the following result: 

Lemma 1. For any f G and any a > 



E 



sup |7n(5)-7n(/)| I X^ 



^ , \T\ 
^ 2 a\ —. 



n 



Proof. First of all, let us mention that, since the different variables we consider take values 
in {0; 1}, we have for all x G ^ and all y G {0, 1} 



\{x)j^y - '^f{x)y^y = (9{x) - f{x)){l - 211j,=i), 



yielding 



n- r 1 " 

(5)-7n(/) = - E - f{X,)) (l-2]ly,=i)-E - ^ {g{X,) - f{X,)) (1 - 211y,=i) | X[ 



1=1 



Let us now consider a Rademacher sequence of random signs (ei)i^i^„ independent of 
{Xi,Yi)i^i^n- Then one has by a symmetrization argument 



E 



sup |7n(5) -7n(/)| I 



^ E 



sup — 



Y,ei{g{Xi)-f{Xi)){l-2^iY,=i] 



1=1 



Since g and / belong to Ft-, we have that 

9- f = ^{at - bt)'Pt, 
tef 

where each [at, bt) takes values in [0, 1]^ and {ipt)^^f is an orthonormal basis of J-t adapted 
to T (i.e. some normalized characteristic functions). Then by applying the Cauchy- 
Schwarz inequality, since g G BT{f,cr), \\g - f\\l = dl{f,g) = Etgf ~ ^«)^ ^ 
obtain that 



J2eMX,)-fiXi)){l-2lY,=i) 



i=l 



\ 



53 X;Ei(l-211y,=i)v,(X,) 



1 



^ (^ei(l-2%,.,)v,(Xj) 

teT 



-rp \i = l 



Finally, since (ej)i^i^n and (1 — 211y,=i)i^i^„ take their values in { — 1;1}, (ei)i^i^„ are 
centered and independent of {Xi,Yi)i^i^nj and since for each t & T \\(pt\\n = 1; Jensen's 
inequality implies 



E 



sup \%{g)-%if)\ I 

geBriM 



€ 2- 



\ teT 



i=l 




And the proof is achieved. 



□ 
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5.2 Proof of Proposition [T] 

To prove Proposition [H we adapt results of Massart |27t Theorem 4.2], and Massart and 
Nedelec [29j (see also Massart et.al. [28]). Similar methods are used in [33] . 

Let n = n2- Let us give a sample £2 = {(-^^1,^1), • • • , iXn,Yn)} of the random variable 
{X, Y) e X X [0, 1], where X is a measurable space and let /* G J" C {/ : i-> [0, 1] ; / e 
L^(Af)} be the unknown function to be recovered. Assume {J'm)meM„ is a countable 
collection of countable models included in J-". Let us give a penalty function pen„ : 
Ain — > 1^+; and 7 : J-" x (Af x [0, 1]) — > M+ a contrast function, i.e. 7 such that 
/ I— )• E [7(7, (X, y))] is convex and minimum at point /*. Hence define for all / G the 
expected loss /(/*, /) = E [7(7, {X, ¥)) - j{f*, (X, ¥))]. 
Finally let 

ln = lj2^{;iXi,Y^) (15) 

1=1 

be the empirical contrast associated with 7. Let m be defined as 



m = argmm 

m&Mn 



7n(/m) + pen„(m) 



where fm = argniiiiggj-^7n(5) is the minimum empirical contrast estimator of /* on 7>„. 
Then the final estimator of /* is 

f = L. (16) 

One makes the following assumptions: 

Hi: 7 is bounded by 1, which is not a restriction since all the functions we consider take 
values in [0, 1]). 

H2: Assume there exist c ^ {2y/2)~^/'^ and some (pseudo-)distance d such that, for every 
pair {f,g) E T'^, one has 

Var [7(5, {X, ¥)) - 7(/, {X, Y))] ^ d\g, /), 

and particularly for all / G 

d\f\f)^cH{rj). 

H3: For any positive a and for any / G Fm, let us define 

Bm{f,<y) = {g ^ J'm ; d{f,g) ^ a} 

where d is given by assumption H2. Let 7„ = 7n(-) — IE[7ji(.)]. We now assume that for 
any m G Aim there exists some continuous function (j)m mapping R4. onto M+ such that 
4'm{0) = 0, (j)mix)/x is non-increasing and 



E 



sup hn{g)-ln{f)\ 



for every positive a such that ipmif^) ^ o"^. Let Em be the unique solution of the equation 

(prn{cx) = , X > 0. 



One gets the following result: 
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Theorem 2. Let {{Xi, Yi), . . . , (X„, 1^)} be a sample of independent realizations of the 
random pair {X,Y) G X x [0,1]. Let {J^m)m&M ^ countable collection of models 
included in some countable family C {/ : A' i— [0, 1] ; / € L^(Af)}. Consider some 
penalty function pen„ : — > and the corresponding penalized estimator f il6\) of 
the target function f*. Take a family of weights {xm)meMn such that 

S = ^ e""^'" < +00. (17) 

m£Mn 

Assume that assumptions Hi, H2 and H3 hold. 

Let ^ > 0. Hence, given some absolute constant C > 1, there exist some positive constants 
Ki and K2 such that, if for all m G Ain 

pen„(m) ^ Kie^ + K2C^ — , 



n 



then, with probability larger than 1 — He ^ , 



l{f\f)^C inf [/(r,J-„)+penJm)] + C'c 

m£Mn n 

where l{f*,J^m) = inf/,„ej-,„ l{f*,fm) and the constant C only depends on C. 



Proof. The proof is inspired of Massart |27j and Massart et.al. [28]. We give only sketches 
of proofs since those are now routine results in the model selection area (see [28] for a 
fuller overview). The interested reader may find the detailed proofs in the first version of 
the paper [T5] . 



Let m G 7W„ and f„i S F„i. The definition of the expected loss and the fact that 
7n(/) + pen„(m) < 7„(/m) + pen„(m) 
lead to the following inequality: 

l{f\ f) ^l{f*, fm) + Inifm) " 7n(/) + pen„(m) - pen„(m) (18) 

where 7„ is defined by The general principle is now to concentrate 7n(/m) — 7n(/) 

around its expectation in order to offset the term pen„(m). Since in G Mn, we proceed 
by bounding Jnifm) — ln{fm') uniformly in m' G Mn- For m' G Ain and / G Fm'-, let us 
define 



Wm' if) = [Vlif*Jm) + VKFJ) 

with y-m' ^ Sm' ) where Em' is defined by assumption H3 . Hence let us define 

X. Inifm) - ln{f) 

Vm' = sup —■ . 

/G-F„, Wm'U) 

Then ([T8|) becomes 

Kf*J) ^ Kf* Jm) + VrhWrhif) + pen.^{ni) - pen^{m) 
Since V^/ can be written as 

,'7(/m,-) -7(/, •)" 

Vm' = sup Un 
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where Un is the recentered empirical measure, we bound V^' uniformly in m' G Ain by 
using Rio's version of Talagrand's inequality recalled here: if is a countable family of 
measurable functions such that, for some positive constants v and b, one has for all / G 
P{f'^) ^ V aiid ll/lloo ^ b, then for every positive y, the following inequality holds for 
Z = snpf^APn - P){f ) 



, , , :i; + 46E(Z))y by 

Z - E(Z ^ \\ 2^-— + — 

n n 



c e-y. 



To proceed, we need to check the two bounding assumptions. First, since by assumption 
Hi the contrast 7 is bounded by 1, we have that, for each / S J-m'^ 



l{f,-)-l{fm,-) 



Wm'if) 



1 



Second, by using assumption H2, we have that, for each / € Fr, 



Var 



7(/, (X,y))-7(/^,(X,F)) 



Wm' if) 

Then, by Rio's inequality, we have for every x > 



(19) 



(20) 



P 



'c2 + 16E(y^/) X 

-X H 7T 



Let us take x = Xm' + C > 0) where Xm' is given by (|T7jl . Then by summing up over 
m' € Mn, we obtain that for all m' € Ain 



/c2 + 16E(VW) , , ^m'+C 
-2 [Xm' + 6 H 



'^nyt, 



on a set such that ^(ilg) ^ 1-Se"^. We now need to bound E,(Vm') in order to obtain 
an upper bound for V^' on the set of large probability $7^. By using techniques similar to 
Massart et al.^s [29] . we obtain the following inequality via the monoticity of x 1— )• (j){x)/x 
and the assumption c ^ {2\/2)^^^'^: for all m' G A^„) 



sVWem' + c(2n)-V2 



Hence, taking 



K 



sVWem' + c(2n)-i/2 _^ ^ 



Xm' +C 



n 



with K > 0, we obtain that, on ri^, for all m' £ Mn, 



Vm' ^ 



+ 



Ky/2J 2Ky/2 



Finally, by using repeatedly the elementary inequality (a + /3)2 ^ 20^ + 2/3^ to bound y"^ 
and Wm{f), we derive that the following inequality holds on 0,^ for any m € A^n and any 

fm ^ Fm' 

(l-2i^' )/(/*,/) ^ (1 + 2K') lif*, fm) + pen„(m) + 2K'K^^ + 



n 



n 



+5 X 2''K'K^e^ + 2c'K'K'— - pen„(m), 

n 
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with 

K' = i^i = 5 X 2^K'K\ K2 = 2K'K\ 

achieving the proof. □ 

Application to classification trees: 

Let us now suppose that {X^Y) takes values in A' x {0, 1}. The contrast is taken as 
7(/, {X,Y)) = llj(x)7^y) the expected loss is defined by ([3]), and the collection of models 
is iJ^T)T^T,nax- The models and the collection are countable since there is a finite number 
of functions in each J^t, and a finite number of nodes in TfYiQ^x • Since we are working 
conditionally on Ci, we can apply Theorem [2] directly with £2- To check assumption H2, 
let us first note that, since all the variables we consider take values in {0, 1}, we have the 
following for all classifiers / and g 

(7(/,(X,y))-7(5,(X,y)))2 = (]ly^;(x)-]ly^,(x))' (21) 

= ifiX)-giX)f. (22) 

Then if we take d'^{f,g) = E {{f{X) - g{X)f) = \\f - gf, where ||.|| is the L^-norm 
with respect to the marginal distribution of X, we have that, for all classifiers / and g, 
Var [jig, {X,Y)) — 7(7, {X,Y))] ^ (P(f,g). Moreover, with the margin condition MA(la), 
we have that 

Krj) > h\\f-rf, (23) 

hence assumption H2 is checked with d?{f,g) = \\f — gW^ and = 1/h, where h is the 
margin. By definition of h, we have /i ^ 1 ^ 2V2, and then c ^ (2^2)-'^/'^. 



Then assumption H3 is checked by Lemma[T]with (j)Tix) = 2xy \T\/n. Hence Theorem[2] 



is verified with ex = y^l/hy \T\/n. 

Finally, to choose a convenient family of weights ixj,)T^T,naa:J taking Xj, = 0\T\, with 
9 > 2 log 2 independent of |r| as done in p^, we immediately obtain = < +00. 
Then we get proposition [1] by Theorem [2j 

5.3 Proof of Proposition [2] 

In what follows, we denote by £1 the sample {{Xi,Yi), . . . , {Xn,Yn)} of size n of the 
random variable {X,Y), and by X^ the sample {Xi, . . . 

First we generalize Theorem [2] to random models, and then we apply it to CART. Let 
{X,Y), f* £ J^, Ci = {{Xi,Yi), . . . , (X„,y„)}, 7 and 7„ be defined as in subsection 
Finally let us rewrite the expected loss of / E conditionally on X^ as 



«n(r,/) = Ei 



1 " 

-^(7(/, (x„y,))-7(r,(A„y,))) 

71 ^ ^ 



n . 



where Ey is the expectation with respect to the marginal distribution of Y. 
Let us consider a collection of at most countable models {J^m)meMj, and a subcollection 
{J^m)meMn, where Mn C X* may depend on {(Xi, yi), . . . , y„)}. Finally let us 
consider a penalty function pen^ : M.^ ^ ^+ and let us define the estimator / of /* as 
follows: let 

m = aigmm^^^J-fnifm) + pen„(m)]. 
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where fm = argminjgjr^7„(/) is the minimum contrast estimator of /* on Then 

/ frh- 

Let us make the following assumptions. 
Hi: 7 is bounded by 1. 

H2: Assume there exist c ^ (2-v/2)^^^^ and some (pseudo-)distance dn (that may depend 
on Xi ) such that, for every pair {g, /) G -F^, one has 

Var[7(5,(X,y))-7(/,(X,y)) | X^]^dlig,f), 

and particularly for all / G J-" 

H3: For any positive a and for any / G Tm, let us define 

Bm{f,<y) = {5 G ; dn{f,g) ^ ct} 



where dn is given by assumption H2. Let 7^ be defined as (jT4l) . We now assume that for 
any m G Ain, there exists some continuous function 0^ mapping onto such that 
•/•mCO) = 0, (j)m{x)/x is non-increasing and 



E 



sup |7n(5) -7n(/)| I 



for every positive a such that (/>m(c") ^ c"^- Let 6^ be the unique solution of the equation 
0m (cx) = , X > 0. 

One gets the following result. 

Theorem 3. Let Ci = {{Xi,Yi), . . . , be a sample of independent realizations of 

the random pair {X,Y) G A' x [0, 1]. Let {^m)meM* ^ countable collection of models 
included in some countable family C {f : X ^ [0,1] ; / G L^(Af)} (which may depend 
on Xf). Consider some subcollection of models {J^m)m£M„7 where Ain C 7W* may depend 
on Ci, and some penalty function pen„ : M.^ — > M+. Let f [W\) be the corresponding 
penalized estimator of the target function f*. Take a family of weights {xm)m£M^ such 
that 

e"^'" ^ S < +00, (24) 

with S deterministic. Assume that assumptions Hi, H2 and H3 hold. 

Let ^ > 0. Hence, given some absolute constant C > 1, there exist some positive constants 

Ki and K2 such that, if for all m G Mn 

pen„(m) ^ Kie^ + K2C^ — , 



n 



then, with probability larger than 1 — 2Se ^ , 



Uf\f)^C inf [/„(r,J-m) + pen„(m)] + C'c- 

m^M„ n 
where ln{f*-,^m) = iiif/„ej-„ ln{f*,fm) ctnd the constant C only depends on C. 
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Proof. The proof is highly similar to that of Theorem [2l The main differences are in the 
conditioning and the fact that the collection of models {J^m)m£M„ is random. To remove 
these issues, all the bounds are computed uniformly on Ai^ so that the probability of the 
set we finally obtain is unconditional to since S is deterministic. The inequalities are 
obtained by the same techniques as the ones used for the proof of the results on model 
selection on random models done by Gey and Nedelec in [T7|. 

Let m £ Ain and fm S Tm- Starting from (fT8]l . we have 

lnU*J) ^ lnif*,fm) + {f)Vm,m + pen„(m) - pen„(m), 



(25) 



where for all m' and M in all / G J-^i and /a/ G J~M: 

m' + yM) , 



Vr 



m'M 



sup 



with Um' ^ Sm' and um ^ em- The general principle is now exactly the same as in the 
proof of Theorem [2] despite the fact that we have to bound Vm'^M not only uniformly in 
m' G but also in M G A^* in order to have an in-probabilty inequality that does not 
depend on X". 

Assumption H2 permits to give exactly the same upper bounds (except that they depend 
on X" and that y^' is replaced by ym'+UM) as ([19]) and (pOj) . By using the same techniques 
as in the proof of Theorem [2] and the same considerations as in [T7] , we obtain that 



m'M 



1 



Vm' + VM 



8\/l0e^/ + + 8\/l0eAf + 



, /c2 + 16(8VT0(ew+eM) + c(2n)-V2)(y^, + 



+ 



1 



Mytn' + yii) 

Xm' + g/2 ^ XM + C/2 



n 



n 



on a set 0^ such that P {9.^ \ Xf) ^ 1 - 2Ee~«. Then, since S is deterministic, we get 
that P{^i:) ^ 1 - 2Se-«. 

Hence, if we take for all m' G TWJ^ 



2/n 



2e: 



we obtain that, on Qg, for all m' and M in Ai^i 

1 



K 



Finally the proof is achieved in the same way as the proof of Theorem [2j 



□ 



Application to classification trees: 
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Let us consider the classification framework and the collection of models iJ^T)T<Tmax 
obtained via the growing procedure in CART (see subsection 13. ip as recalled in subsection 
15.21 Since the growing and the pruning procedures are made on the same sample Ci, 
the conditions of Theorem [3] hold. Since ni is fixed, let us consider A^* as the set of all 
possible tree-structured partitions that can be constructed on the grid Xf , corresponding 
to trees having all possible splits in S and all possible forms without taking account of 
the response variable Y. So depends only on the grid and is independent of 
the variables (Yi, . . . ,Yn)- Then {T ^ Tmax} C A^* and we are able to apply Theorem 
El Considering (f2T]) . we take dn{g,f) = \\f — g\\ni where ||.||„ is the empirical norm on 
X"^. Using the margin condition MA (lb), (|23p is also verified for In and dn, and we 
have assumption H2 with = 1/h. Then, by Lemma [U assumption H3 is checked with 



(pri^) = 2xY/|T|/n and, in the same way as in the proof of Proposition (U et is taken as 



Finally, to choose a convenient family of weights {x,^)t£M^j taking (see [17]) 

XT = v{9 + log^) \f\, 

where V is the VC-dimension of the set of splits S used to construct Tmax and 6 > 1, we 
obtain 

S„ = = ^ exp {-{6 - 1)DV) < +00. 
And we have Proposition [2j 
5.4 Proof of Proposition [3] 

Proposition [3] is a direct application of the theorem obtained by Boucheron, Bousquet and 
Massart |7j recalled here: assume that we observe N + n independent random variables 
with common distribution P depending on a parameter /* to be estimated. Suppose 
the first observations Z' = Z[, . . . , Z'^ are used to build some preliminary collection 
of estimators {fm)m£M„ and the remaining observations Zi, . . . , Z„ are used to select an 
estimator / among this collection by minimizing the empirical contrast as defined by (|15p 
(with {X,Y) replaced by Z). Hence we have the following result. 

Theorem 5.4.1 (Boucheron, Bousquet, Massart [7j). 

Suppose that Ain is finite with cardinal K. Assume that there exists some continuous 
function w mapping onto R-^ such that x 1— )• w{x)/x is nonincreasing, and which 
satisfies for all e > 

sup Var [7(/, Z) - j{f*, Z)] ^ w{e). (26) 

Then one has for every 9 € (0, 1) 

(1 - 0)E [/(/*,/) I Z'j (1 + 0) inf l{f* J^) + 5l (20 + {1 + log {K))i\ + h), 

where I is defined by ^ and (5* satisfies y/n5l = w{5^). 

Taking w{e) = {l/\/li)e for both methods Ml and M2, where h is the margin, leads 
to proposition [3] with 

1 + 9 9 + 2, r -r ^ ^ 

^-T^' ^'-29{l-9y + 
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5.5 Proof of Theorem [T] 

We are now able to prove Theorem [T] via propositions [TJ [5] and [31 The beginning of the 
proof remains the same if / is constructed either via Ml or M2. So we just give the first 
step of the proof for the Ml method. 

ActuaUy, since we have at most one model per dimension in the pruned subtree sequence, 
it suffices to note that K ^ ni. Then let oq be the minimal constant given by Proposition 
[TJ Hence, since for a given a > Tq, belongs to the sequence {Tk)i<^k^K, 



Starting from this inequality, if / is constructed via Ml, by using Proposition [T] with 
a = 2aQ and by taking the expectation according to C2, we obtain Theorem [1] with the 
appropriate constants. 

Yet, if / is constructed via M2, we apply Proposition [2] with a = 2aoaniy and, for each 
6 €]0; 1[, ^ = log (2T,a/S). Then we obtain Theorem [T] with the appropriate constants. 
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